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EVALUATION 


A study  was  conducted  to  determine  the  effects  of  g-stress  on  a 
speaker's  voice  patterns  and  the  subsequent  effect  on  word  recognition 
accuracy.  Data  was  obtained  on  nine  subjects  at  g-levels,  1G,  3G,  5G 
and  at  7G.  All  the  subjects  wore  a face  mask  and  made  several  repeti- 
tions of  the  digits  (except  at  7G)  in  a 13°  seat.  Algorithms  were 
developed  to  compensate  for  the  major  problems  caused  by  breath  noise 
and  the  change  of  the  voice  characteristics  caused  by  the  face  mask. 


ROBERT  A.  CURTIS 
Captain,  USAF 
Project  Engineer 


1 . INTRODUCTION 


1 . 1 BACKGROUND 

Considerable  progress  has  been  made  in  recent  years  in  the  field 
of  automatic  recognition  of  human  speech.  The  point  has  been 
reached  where  isolated  word  recognition  devices  are  feasible  for 
a number  of  applications  both  commercial  and  military.  However, 
these  automatic  speech  recognition  equipments  generally  require 
a benign  operating  environment,  where  the  signal-to-noise  ratio 
is  good  (greater  than  20  dB)  and  where  the  transmission  path  is 
subject  to  complete  control.  There  are  military  applications 
where  the  environment  is  not  so  benign,  but  where  voice  recogni- 
tion capability  could  be  effectively  utilized  if  available. 

One  such  application  is  in  the  cockpit  of  a fighter  aircraft, 
where  the  requirement  exists  for  a voice  command  system  for  a 
pilot  to  is  mission  more  effective.  A number  of  aspects 

of  the  c nvironment  make  it  difficult  for  a voice  recog- 
nition One  of  these  that  has  long  been  considered  tc  lie 

a rnajoo.  obstacle  to  voice  command  in  the  cockpit  is  the  g-force 
experienced  by  the  pilot  during  aircraft  maneuvers.  The  objec- 
tive of  the  research  reported  here  was  to  determine  the  effects 
of  g-force  stress  on  the  pilots'  speech  characteristics  so  that 
these  effects  can  be  taken  into  account  in  the  design  of  automa- 
tic speech  recognition  equipment  for  the  airborne  environment. 

A major  effect  of  g-force  stress  on  the  human  body  is  the  tenden- 
cy to  force  blood  away  from  the  brain  causing  blackout,  which  is 
tantamount  to  fainting.  Most  subjects  can  withstand  g-force 
stress  up  to  about  3g  when  seated  in  an  upright  position  without 
effort  or  risk  of  blackout.  At  higher  stress  levels  the  subject 
must  work  to  avoid  blackout  by  tightening  the  muscles  on  his 


chest  and  diaphragm  to  constrict  the  blood  vessels  there.  This 
tends  to  prevent  the  blood  drain  from  the  head.  The  subject  may 


become  winded  as  a result  of  this  effort.  At  any  rate  it  is  not 
conducive  to  maintaining  consistent  speech  patterns. 


1.2  SUMMARY  OF  METHODS,  RESULTS  AND  CONCLUSIONS 

For  this  program  a data  base  was  collected  on  the  human  centri- 
fuge at  Brooks  Air  Force  Base.  A voice  recognition  device,  SEI's 
VDETS , was  used  to  prompt  the  subjects  and  to  provide  an  on-line 
test  of  recognition  capability.  Data  from  nine  subjects  at  lg, 
3g,  5g,  and  7g  were  collected.  Good  quality  audio  recordings 
were  made. 

Subsequently  the  recordings  were  processed  through  SEI's  VDETS 
and  the  raw  spectral  data,  normally  collected  by  that  device  and 
used  for  further  processing,  were  transferred  to  and  stored  in 
SEI's  DEC-10  computer  system.  Routines  to  emulate  the  VDETS  rec- 
ognition algorithm  and  a number  of  variations  on  it,  as  well  as 
routines  to  preprocess  the  data  and  to  analyze  it  in  various  ways 
were  developed  on  the  DEC-10  and  applied  to  the  data  base. 

It  was  found  that  recognition  performance  on  the  centrifuge  data 
was  significantly  poorer  than  performance  on  the  same  vocabulary 
but  under  normal  conditions  with  the  microphone  supplied  with  the 
VDETS  system.  Performance  decreased  with  increasing  g-force 
stress,  but  was  comparatively  poor  even  at  the  lg  level  for  most 
of  the  subjects.  Substantial  improvements  were  obtained  by  vari- 
ous modifications  to  the  basic  recognition  algorithm  but  a fully 
satisfactory  solution  was  not  found. 

Changes  in  voice  patterns  with  g-force  stress  were  found,  but 
there  was  no  consistent  pattern  to  these  changes.  The  underlying 
physical  causes  of  the  effects  were  not  definitely  established. 
However,  there  is  evidence  to  support  the  conclusion  that  diffi- 
culties in  recognizing  the  g-force  stressed  word  patterns  were 
attributable  to: 
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• Breathlessness  on  the  part  of  the  subject  as  a result  of 
the  efforts  required  to  avoid  blackout. 

• Variable  modifications  of  the  acoustic  transmission  path 
caused  by  changes  in  the  cavity  formed  by  the  face  mask 
around  the  mouth  and  nose  of  the  subject. 
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2. 


TECHNICAL  DISCUSSION 


2.1  DATA  COLLECTION 

The  principal  data  for  the  program  were  collected  at  the  Human 
Centrifuge  facility  at  Brooks  Air  Force  Base,  San  Antonio,  Texas. 
All  of  the  subjects  were  from  the  regular  pool  for  centrifuge  ex- 
periments. In  all,  nine  subjects  each  provided  one  series  of 
runs.  There  were  seven  subjects  dedicated  to  this  program  and 
two  additional  subjects  who  were  involved  in  a different  program, 
but  were  able  to  provide  some  voice  data  at  the  same  time. 

For  the  subjects  dedicated  to  the  program  runs  were  made  at  3g, 
5g,  and  7g.  The  seat  in  the  centrifuge  gondola  was  set  at  the 
normal  13°  angle  and  the  subjects  wore  the  RAF  type  PQ  face  mask 
with  built-in  microphone.  The  non-dedi c ated  subjects  v*«re  test- 
ing an  experimental  helmet  and  face  mask  designated  MB  5/P,  and 
of  course,  used  this  for  the  voice  data  collection.  These  sub- 
jects had  to  make  a run  through  which  the  g-levels  were  varied 
to  simulate  a particular  maneuver,  and  they  could  not  provide 
voice  data  at  this  time.  Hence  less  data  was  obtained  from  the 
non-dedicated  subjects. 

The  original  plan  for  the  data  collection  effect  called  for  a 
vocabulary  of  words  and  phrases  representative  of  those  that 
might  be  useful  for  voice  cockpit  control  functions  under  real 
conditions.  As  it  turned  out,  the  time  available  on  the  centri- 
fuge was  limited  to  one  week,  there  was  not  an  unlimited  supply 
of  subjects,  and  the  time  that  each  subject  can  spend  at  levels 
of  5g  and  above  is  limited.  At  5g  each  subject  is  allowed  30 
seconds  at  a time  and  one  minute  per  day.  In  view  of  these  re- 
strictions it  was  decided  to  limit  the  vocabulary  to  10  words, 
the  digits  zero  to  nine,  on  the  grounds  that  many  samples  of  the 
same  word  under  various  conditions  would  be  more  valuable  than  a 
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like  number  of  different  utterances.  This  proved  to  be  a wise 
decision . 

A block  diagram  of  the  setup  used  for  recording  data  in  the 
human  centrifuge  is  shown  in  Figure  1.  A SCOPE  Electronics  VDETS 
voice  recognition  device  was  used  in  the  experiment.  This  device 
will  be  described  more  fully  later.  One  of  its  components,  a 
self-scan  display,  was  mounted  in  the  gondola  in  front  of  the 
subject.  This  display  provides  16  alphanumeric  characters  that 
can  be  used  to  prompt  the  subject,  indicate  recognition  results, 
or  provide  any  other  type  of  message  that  can  be  contained  in  the 
16  character  format. 

The  audio  from  the  face-mask  microphone  was  fed  through  a pream- 
plifier part  of  the  normal  gondola  instrumentation,  to  the  input 
of  the  VDETS  contained  in  the  same  box  as  the  self-scan  display. 
The  VDETS  channel  provides  gain  and  an  adjustable  attenuator. 

The  VDETS  output  was  fed  through  a coax  channel  via  slip  rings  to 
the  main  VDETS  processor.  A parallel  signal  was  provided  to  the 
right  channel  of  a Wollensak  Type  6250  tape  recorder.  The  voice 
output  of  the  gondola  preamplifier  was  also  transmitted  through 
slip  rings  to  the  normal  audio  channels  used  with  the  centrifuge 
system.  These  channels  provide  mixing  of  the  audio  from  the  gon- 
dola with  the  audio  from  a microphone  in  the  control  room  used  to 
instruct  the  subject.  The  combined  audio  was  recorded  on  an  in- 
strumentation recorder  and  also  the  left  channel  of  the  Wollensak 
Recorder.  Previous  experience  had  indicated  a source  of  distor- 
tion in  the  amplifiers  normally  used  in  the  centrifuge  system. 
Also  it  was  desirable  not  to  have  the  control  room  audio  on  the 
voice  data.  Some  care  was  taken  to  separate  the  channels  as 
shown  in  Figure  1.  The  distortion  problem  was  cleared  up,  but 
there  was  still  some  feedthrough  from  the  control  room  mike  to 
the  voice  data  in  the  right  channel  of  the  Wollensak.  This  was 
solved  by  keeping  the  control  room  mike  off  during  test  runs. 


GONDOLA  CONTROL  ROOM 


>< 


^ 

The  VDETS  self-scan  was  used  to  prompt  the  subject  and  to  indi- 
cate recognition  results  during  the  test.  The  next  word  to  be 
spoken  was  displayed  on  the  self-scan.  After  the  subject  had 
spoken  that  word,  a "C"  or  an  "X"  was  displayed  to  indicate  cor- 
rect or  incorrect  recognition.  Use  of  the  VDETS  in  the  data- 
taking  provided  some  feedback  to  the  subject  and  perhaps  added 
some  interest  to  the  experiment  from  his  standpoint.  The  major 
function  it  provided,  however,  was  to  pace  the  rate  of  speaking. 

In  some  previous  experiments,  where  the  subject  had  merely  been 
instructed  tc  repeat  the  digits  over  and  over,  the  rate  of  speak- 
ing turned  out  much  too  fast  for  an  isolated  word  recognition 
system  to  follow. 

The  subjects  used  in  the  experiment  had  no  previous  experience 
with  word  recognition  devices  and  had  little  chance  to  become 
familiar  with  the  VDETS  system  during  the  course  of  the  experi- 
ment. The  procedure  in  most  cases  was  as  follows.  The  subject 
was  given  a brief  explanation  of  the  VDETS  system.  The  training 
and  recognition  functions  were  explained.  He  was  then  given  one 
practice  run  consisting  of  five  training  passes  and  several 
passes  through  the  word  list  to  familiarize  himself  with  the  dis- 
play and  the  system  operation.  He  then  took  his  position  in  the 
gondola  and  the  door  was  closed.  All  subsequent  passes  were  re- 
corded. These  consisted  of: 


• Five  and  sometimes  ten  training  passes  with  the  words 
repeated  in  order  at  lg  acceleration  (centrifuge  station- 
ary) . 

• Eight  to  ten  passes  through  the  word  list  at  lg.  For 
these  passes  the  subject  was  prompted  and  the  word  order 
was  varied  through  ten  different  sequences. 

• Eight  to  ten  passes  at  3g  acceleration. 

• Two  runs  at  5g  acceleration  of  approximately  30  seconds 
duration  each.  This  usually  provided  time  for  four 
passes  through  the  word  list  each  time. 

• Eight  to  ten  passes  at  lg  immediately  following  the  5g 
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TABLE  I.  SUMMARY  OF  DATA  COLLECTED  ON  BROOKS  AFB  HUMAN  CENTRIFUGE 


ABORTED 

PARTIAL  REPETITION 
RETRAIN 


runs.  In  most  cases  the  subject  was  somewhat  winded  at 
the  start  of  these  passes. 

• A ?g  run  with  the  subject  repeating  as  many  words  as 
possible.  About  one  pass  through  the  word  list  was 
average  for  the  7g  runs. 

A summary  of  the  data  collected  at  the  Brooks  Human  Centrifuge 
Facility  is  shown  in  Table  I. 

2.2  DATA  PROCESSING  FACILITIES 

Facilities  for  processing  the  speech  data  consisted  of  hardware 
and  software.  Existing  SEI  test  equipment  and  computers  were 
used.  No  hardware  was  developed  on  the  program.  A substantial 
amount  of  the  software,  consisting  primarily  of  FORTRAN  programs 
for  SEI's  DECsy stera-10  computer,  was  developed  as  part  of  the 
program  effort. 

2.2.1  Hardware 

The  principal  equipment  used  for  processing  of  the  speech  data 
was  the  development  support  system  for  the  SEI  VD ETS  voice  recog- 
nition equipment.  A block  diagram  of  the  VDETS  DSS  is  shown  in 
Figure  2.  A Data  General  NOVA  2 minicomputer  is  the  central  pro- 
cessor for  the  system.  Standard  peripheral  devices  include  a 
teletype  terminal,  a high  speed  paper  tape  reader  and  punch,  a 
cassette  tape  unit,  and  a Line  tape  unit.  The  system  has  a 9600 
baud  RS-232  interface  to  the  SEI  DECsystem-10  computer.  This 
interface  was  used  for  transmitting  raw  data  from  the  NOVA  to  the 
DEC-10  for  storage  and  analysis. 

Also  interfaced  to  the  NOVA  in  th«=>  VDETS  system  is  the  speech 
processing  front  end  shown  in  the  block  diagram  of  Figure  3. 

Audio  input  to  the  speech  processing  front  end  is  amplified  and 
applied  to  a 16  channel  filter  bank.  The  bandpass  characterist- 
ics of  these  filters  are  shown  in  Figure  4.  The  filter  outputs 
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Figure  2.  VDETS  Development  Support  System 


16  BANDPASS 


CONTROL  | _ SAMPLE  COMMAND 


are  det  cted  and  the  detected  outputs  are  filtered  in  low  pass 
filters  of  approximately  25  Hz  cutoff  frequency.  The  detected 
and  filtered  outputs  are  fed  through  a 16  channel  multiplexer  to 
an  8 bit  A/D  converter. 

The  speech  front  end  processor  is  interfaced  to  the  computer 
through  a NOVA  general  purpose  interface  board.  Through  this 
interface  commands  can  be  given  to  the  multiplexer  to  control  the 
filter  channel  to  be  sampled  and  to  the  analog-to-digital  conver- 
ter to  begin  conversion.  At  the  end  of  conversion  the  digitized 
output  as  well  as  an  end-of-conversion  indication  is  available 
on  the  NOVA  input  bus. 

2.2.2  VDETS  Software 

Software  for  the  VDETS  system  includes  a proprietary  core- 
resident operating  system,  VOICE . VOICE  contains  the  algorithm 
for  training  and  recognition  of  designated  vocabularies,  as  well 
as  the  routines  to  service  all  peripheral  devices.  In  addition, 
VOICE  contains  provision  for  user  programming  of  vocabulary  and 
action.  The  actions  may  be  triggered  by  various  events,  such  as 
interrupt  from  an  external  device,  the  end  of  a training  sequence, 
or  the  recognition  of  a specific  word  in  the  vocabulary.  If  no 
specific  action  is  associated  with  a vocabulary  word,  then  a gen- 
eral default  action  is  triggered  at  the  completion  of  each  word 
recognition  process.  Action  triggers  may  cause  the  execution  of 
routines  written  in  the  special  programming  language  of  the  sys- 
tem. The  commands  available  in  the  programming  language  permit 
retrieval  of  the  word  recognized,  simple  arithmetic  and  logical 
functions,  as  well  as  control  of  the  peripheral  devices. 

As  a simple  example  of  a voice  program,  consider  the  one  used  for 
the  Brooks  Air  Force  Base  data  collection  effort.  In  this  case, 
the  vocabulary  was  the  digits  arranged  in  order  "one,  two,  ..., 
zero."  The  train  action  was  controlled  by  command  from  the  con- 
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sole,  i.e.  typing  the  command  "TRAIN"  causes  the  system  to  clear 
out  previously  stored  reference  patterns  and  initiate  a new 
training  sequence.  To  initiate  training,  the  system  displays 
the  first  vocabulary  word  on  the  self-scan  device.  On  detecting 
an  utterance,  the  system  processes  it  and  stores  the  result  as 
the  beginning  of  the  reference  pattern  for  the  first  word.  It 
then  displays  the  next  word  in  the  vocabulary  and  continues  until 
a number  of  passes  through  the  vocabulary  have  been  completed. 

The  number  of  training  passes  is  under  programmer  control;  in 
this  case  it  was  set  to  five.  At  the  end  of  the  training  se- 
quence the  system  displays  "END  TRAIN"  on  the  self-scan  display. 

The  command  "UPDT"  will  initiate  a one-pass  update  to  the  train- 
ing process.  In  the  updating  process,  data  from  the  new  samples 
are  combined  with  tnat  already  stored  to  modify  the  reference 
patterns.  The  command  "RETRAIN"  will  cause  the  system  to  ask  for 
the  word  number  and  then  to  initiate  a new  training  sequence  of 
five  passes  on  that  word  only.  In  this  case,  the  old  reference 
pattern  is  cleared  and  a new  one  generated. 

Following  the  end-of-train  the  system  goes  automatically  into 
the  recognition  mode.  In  the  recognition  mode  the  system  can 
operate  either  in  a prompt  mode  or  a no-prompt  mode  fed  by  enter- 
ing "PRM"  or  "NPRM"  on  the  console.  The  default  mode  is  no- 
prompt. The  system  can  also  be  put  into  a stop  or  go  condition 
by  typing  "STOP"  or  "START"  on  the  console.  The  default  con- 
dition is  "START." 

When  the  system  detects  an  utterance,  it  automatically  goes 
through  the  recognition  process  and  finds  the  word  most  closely 
matching  this  utterance  in  its  library  of  reference  patterns. 
Following  this  the  default  action  is  triggered.  If  the  system 
is  in  the  "GO"  condition  and  the  no-prompt  mode,  then  triggering 
the  default  action  causes  the  word  recognized  to  be  displayed  on 
the  self-scan. 
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In  the  prompt  mode  the  system  displays  the  next  word  expected  on 
the  self-scan  display.  Under  the  default  action  triggered  by 
the  next  utterance,  it  compares  the  word  recognized  with  the 
word  expected  and  displays  either  a "C"  or  "X"  depending  on 
whether  or  not  the  recognized  word  matched  the  expected  one.  It 
then  updates  the  expected  word  and  displays  the  next  one  on  the 
self-scan.  The  order  in  which  words  are  prompted  goes  through 
a cycle  of  ten  passes  as  follows:  On  the  first  pass  the  words 

are  presented  in  order  one  through  ten.  The  second  pass  begins 
with  two  and  goes  in  steps  of  three,  etc.  The  sequence  can  be 
started  at  the  beginning  of  any  of  ten  orders  by  typing  the 
command  "I."  The  system  then  asks  for  the  first  word  and  sets 
the  prompt  sequence  for  that  point. 

2.2.3  VDETS  Raw  Data  Processing 

As  described  previously,  the  VDETS  front  end  contains  a spectrum 
analyzer  whose  output  can  be  sampled  under  the  control  of  the 
central  processor.  Normally  the  sampling  process  goes  on  con- 
tinuously at  a rate  of  100  samples  per  second.  At  each  sample 
point  all  16  filters  are  sampled  resulting  in  ±6  8-bit  numbers 
which  are  input  to  the  computer.  If  the  system  is  in  an  idle 
condition,  i.e.  no  word  boundary  has  been  detected,  then  a word 
boundary  test  is  applied  to  each  new  sample.  The  absolute  value 
of  the  spectral  difference  between  the  new  sample  and  the  preced- 
ing sample  is  computed  by  summing  the  absolute  values  of  the 
difference  in  the  filter  outputs  over  the  16  filters  from  one 
sample  to  the  next.  If  the  sum  exceeds  a certain  threshold  then 
the  sample  is  retained  and  stored  in  a buffer.  Then  the  word 
boundary  flag  is  set.  Subsequent  samples  are  tested  in  the  same 
way  and  either  stored  or  rejected.  (A  different  threshold  can 
be  used  for  subsequent  samples  although  in  most  cases  it  is  kept 
the  same.)  When  a total  of  16  samples  in  a row  have  been  re- 
jected, the  ward  is  considered  terminated  and  the  word  boundary 
flag  reset.  If  the  total  number  of  samples  retained  by  this 
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process  is  less  than  some  threshold,  then  the  buffer  is  cleared 
and  the  utterance  ignored.  Otherwise  the  system  proceeds  to  pro- 
cess the  word  just  received. 

The  raw  data  buffer  provides  for  storage  of  up  to  96  samples, 
each  consisting  of  16  8-bit  spectral  energy  numbers.  The  buffer 
can  contain,  therefore,  up  to  1536  characters  or  bytes.  The 
special  version  of  the  VOICE  software  that  supports  the  DEC-10 
communication  interface  also  contains  a special  set  of  actions 
that  can  be  used  in  a user  generated  program  to  initiate  trans- 
mission of  various  types  of  data.  One  such  action  causes  the 
data  in  the  raw  data  buffer  to  be  transmitted  to  the  DEC-10. 
Another  special  action  permits  information  stored  in  user  loca- 
tions, i.e.  under  control  of  the  user  program,  to  be  transmitted. 
Other  actions  control  transmission  of  data  from  other  stages  in 
the  VDETS  processing.  These  actions,  however,  were  not  used  on 
this  program. 

2.2.4  DEC-10  Communications  Software 

At  the  other  end  of  the  data  transmission  path,  DEC-10  routines 
provide  for  handling  speech  data.  The  routines  most  commonly 
used  take  the  received  data  and  pack  the  characters  four  to  a 
word  (the  DEC-10  word  length  is  36  bits)  and  store  the  data  in 
a disk  file.  The  data  can  subsequently  be  transferred  from  disk 
to  magnetic  tape. 

All  of  the  data  from  the  Brooks  Centrifuge  Experiment  were  pro- 
cessed as  described  above  and  stored  in  files  on  the  DEC-10  sys- 
tem. The  process  of  transcribing  data  from  the  audio  tapes  to 
the  DEC-10  in  this  way  was  somewhat  tedious  because  the  trans- 
mission process  was  not  fast  enough  to  operate  in  real  time.  It 
was  necessary,  therefore,  to  stop  and  start  the  tape  recorder, 
waiting  between  each  word  for  the  transmission  process,  which 
took  approximately  three  to  five  seconds.  It  was  necessary,  of 
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course,  to  supply  with  each  word  a label  indicating  which  word 
it  is.  This  can  be  done  automatically  through  the  VDETS  user 
program  if  the  words  are  repeated  reliably  in  a known  sequence. 
With  the  Brooks  data  the  order  was  supposedly  known,  but  there 
were  enough  gaps,  repetitions  and  extraneous  noises,  etc.,  to 
make  automatic  labeling  unsatisfactory.  Each  sample,  therefore, 
was  manually  labeled  after  entry  through  a CRT  terminal  connected 
in  to  the  DEC-10  system.  A list  of  the  data  files  processed  is 
shown  in  Table  II. 

2 . 3 SOFTWARE 

Software  used  in  the  study  consists  of 

• the  VDETS  voice  routine  with  several  modifications 

• FORTRAN  routines  to  operate  on  the  DEC-10 

The  VOICE  operating  system  and  its  modifications  are  done  entire- 
ly in  NOVA  assembly  language.  Application  programs  as  mentioned 
in  Section  2.2.2  in  the  VOICE  system  use  their  own  special  pro- 
gramming language.  Most  of  the  software  effort  and  most  of  the 
study  were  carried  out  on  the  DEC-10.  The  remainder  of  this 
section  will  be  devoted  to  a description  of  the  DEC-10  speech 
processing  library  developed  for  and  used  on  this  program. 

2.3.1  DEC-10  Library 

The  DEC-10  speech  processing  library  used  on  this  program  con- 
sists of  the  following  types  of  routines: 

• A master  training/recognition  routine 

• Various  routines  for  preprocessing  speech  data  files 

• Routines  for  examining,  editing  and  otherwise  manipulat- 
ing speech  data  files 

• Routines  for  analyzing  data  in  speech  data  files 

• Routines  for  plotting  data 

Annotated  listings  of  these  routines  are  included  under  separate 
cover. 
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2.3.2 


Training/Recognition  Program 


RESLT3,  the  final  version  of  the  training/recognition  routine, 
performs  a training  and/or  recognition  experiment  on  a speech 
data  file  in  the  format  developed  for  this  program.  The  basic 
processes  are  patterned  after  the  VDETS  algorithm  and  there  are 
numerous  options  for  the  various  functions  performed. 

As  described  in  Section  2.2.3,  raw  data  for  the  speech  processing 
algorithm  consist  of  a number  of  16-element  spectral  samples, 
each  sample  quantized  to  8 bits.  The  raw  data  are  reduced  to  a 
much  smaller  number  of  bits  by  the  processes  of  segmentation, 
compression  and  coding.  The  output  of  this  process  is  referred 
to  as  the  coded  data  and  is  of  fixed  length  for  all  words. 

The  segmentation  process  divides  the  set  of  raw  data  samples 
comprising  a single  utterance  into  a fixed  number  N,  usually 
eight,  of  subsets  or  segments.  The  compression  process  averages 
the  spectral  samples  in  each  segment  to  produce  N averaged  spec- 
tra. Finally  the  coding  process  reduces  the  N averaged  spectra 
to  the  coded  form  with  a further  reduction  in  the  bit  level. 

An  alternative  procedure  is  to  perform  the  coding  operation 
prior  to  the  compression  operation.  Under  this  option,  the 
coding  process  is  performed  on  each  raw  data  sample  and  the  com- 
pression is  performed  by  averaging  the  coded  data  samples  over 
each  segment  rather  than  averaging  the  spectra.  The  motivation 
for  this  alternative  mode  is  that  it  eliminates  the  need  to  store 
the  raw  data  samples  and  hence  reduces  the  memory  requirements 
for  the  processor. 

The  algorithm  for  segmentation  and  coding  as  well  as  the  number 
of  bits  used  in  the  coded  word  are  not  described  above  because  a 
number  of  options  for  these  processes  are  available.  During  the 
course  of  the  program,  three  algorithms  for  segmentation  and  ten 

algorithms  for  coding  were  tested.  These  are  all  available  in 
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RESLT3  as  well  as  the  code-before-compress  and  code-after- 
compress modes  discussed  in  the  preceding  paragraph.  Details  of 
the  various  processing  modes  are  contained  in  Appendix  A. 

The  training  process  generates  the  reference  patterns  or  tem- 
plates against  which  unknown  words  are  compared  during  the  recog- 
nition process.  The  reference  patterns  are  generated  from  a set 
of  training  samples  consisting  of  one  or  more  repetitions  of  each 
word  of  the  vocabulary. 

RESLT3  provides  a multi-mode  training  process  as  described  in 
Appendix  A under  Reference  Pattern  Distance.  Training  may  pro- 
duce more  than  one  reference  pattern  for  each  vocabulary  word  if 
the  samples  provided  for  training  are  sufficiently  dissimilar. 

The  reference  pattern  for  a given  word  may  be  derived  from  a 
single  training  sample  in  which  case  the  reference  pattern  is 
identical  to  the  training  sample.  Usually,  however,  two  or  more 
training  samples  go  into  a reference  pattern.  In  this  case  the 
coded  words  for  all  of  the  training  samples  used  to  generate  a 
given  reference  pattern  are  compared  element  by  element  and 
elements  that  are  not  consistent  over  the  set  of  training  samples 
are  masked  out  of  the  reference  pattern. 

In  the  recognition  process  the  coded  unknown  word  is  compared 
with  all  reference  patterns  and  a score  generated  for  each  com- 
parison. Several  modes  for  comparison  as  well  as  several  modes 
for  scoring  are  available,  as  discussed  in  Appendix  A.  The  word 
recognized  is  that  associated  with  the  reference  pattern  that 
produces  the  highest  score. 

In  the  simplest  combination  of  comparison  and  coding  modes  the 
unknown  and  the  reference  pattern  are  compared  element  by  element 
and  sample  by  sample  over  all  unmasked  elements  of  the  reference 
pattern.  The  score  is  given  by 

SCORE  = 128 (NBA-HD)/NBA 
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where  NBA  is  the  number  of  active  (i.e.  non-masked)  elements  in 
the  reference  pattern  and  HD  is  the  hamming  distance  between  un- 
known and  reference. 

RESLT3  provides  data  output  as  follows:  For  each  utterance  the 

data  provided  are  the  word  and  the  sample  numbers,  the  in-class 
score  and  the  maximum  out-of-class  score,  the  number  of  the  word 
recognized  and  the  scores  and  word  numbers  for  the  five  highest 
scoring  comparisons.  For  each  vocabulary  word  it  provides  the 
performance  index,  given  by 


where  and  Sq  are  the  in-  and  out-of-class  average  scores,  and 
ck  and  oq  are  the  standard  deviations  of  the  in-  and  out-of-class 
scores  over  all  samples  of  that  word  in  the  file.  For  each  file 
it  provides  a summary  including  the  means  and  standard  deviations 
of  the  in-class  scores  and  the  maximum  out-of-class  scores,  the 
overall  performance  index,  defined  as  before,  the  recognition 
rate,  and  an  error  matrix.  Also  for  each  file  it  provides  a 
header  giving  the  file  number  for  the  unknown,  either  the  refer- 
ence pattern  file  number,  if  obtained  from  a different  file  than 
the  unknown  or  the  sequency  numbers  of  the  samples  used  for 
training  if  obtained  from  the  same  file,  and  identification  of 
the  modes  used.  A description  of  the  modes  is  contained  in 
Appendix  A. 

An  example  of  the  printout  is  shown  in  Figure  5.  In  this  case 
the  unknown  file  was  WBT45.BIN  and  the  experiment  used  the  first 
five  samples  from  that  file  for  training  and  the  remaining  sam- 
ples for  recognition.  The  segmentation  mode  was  2 and  filters 
1-16  were  used  for  the  cumulative  energy  change  segmentation 
algorithm.  Eight  segments  were  used.  Other  mode  numbers  are 
also  indicated  in  the  header. 
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For  individual  utterances  the  first  two  columns  are  the  word  and 
sample  numbers.  The  next  two  columns  are  the  maximum  scores  for 
in-class  and  out-of-class  comparisons.  The  fifth  column  is  the 
number  of  the  word  recognized,  i.e.  the  word  number  associated 
with  the  maximum  score.  The  next  15  columns  are  the  word 
numbers,  reference  pattern  numbers  and  scores  for  the  five  highest 
scores  obtained  on  the  utterance.  For  example  the  first  entry 
indicates  that  sample  6 of  word  1 was  recognized  as  word  9 with 
a score  of  108  (out  of  a maximum  128)  for  word  9 and  a score  of 
97  for  the  best  in-class  comparison.  The  high  scoring  reference 
pattern  was  number  13,  associated  with  word  9 and  the  next  high- 
est was  number  9,  associated  with  word  5.  In  this  printout  the 
option  to  print  individual  results  for  errors  only  was  selected. 

In  this  test  the  best  performance  index  was  6.06  for  word  number 
9.  After  the  data  for  individual  words,  the  next  four  numbers 
are  the  means  and  standard  deviations  of  the  in-class  and  maximum 
out-of-class  scores  for  the  test.  In  this  case  the  mean  in-class 
score  was  114,  the  mean  of  the  maximum  out-of-class  scores  was 
103  and  the  standard  deviations  were  11  and  10  respectively. 

These  data  are  followed  by  the  overall  performance  index  (1.0488) 
and  the  recognition  rate  (82.26).  Finally  the  error  matrix  shows 
a count  of  words  recognized  versus  words  spoken.  For  example, 
of  the  6 samples  of  word  number  1 spoken,  4 were  recognized 
correctly  and  2 were  recognized  as  word  number  9. 

2.3.3  Preproce s s ing  Routines 

During  the  course  of  the  program  several  routines  were  developed 
for  modifying  the  data  in  speech  data  files.  These  routines  read 
data  from  the  original  file  and  create  a new  file  with  the  modi- 
fied data.  The  order  of  the  words  and  the  word  labeling  remain 
intact.  The  principal  preprocessing  routines  used  ware  the 
breath  noise  eliminator,  BNE1 , the  modified  word  boundary  test, 
BWBT , and  the  inverse  filtering  routines  SPEQ  and  SPEQA.  The 
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nature  and  effects  of  these  routines  will  be  discussed  in  Section 

2.4. 


2.3.4  Analysis  and  Plotting  Routines 


Another  set  of  routines  was  used  for  analysis  and/or  plotting  of 
data  from  the  speech  data  files.  These  include: 

• CODEST  - collects  statistics  on  coded  data  from  a speech 
data  file 

• FAPLT  - Plots  data  from  a spectral  file  on  CALCOMP  Plot- 
ter 

• FILAV  - Averages  data  from  a speech  data  file  and  writes 
result  in  a spectral  file  suitable  for  plotting  by  FAPLT 

• FILPLT  - Plots  data  from  a speech  data  file  in  the  format 
of  filter  output  vs  time  frame  over  all  16  filters 

• SEGPLT  - Plots  data  from  a speech  data  file  in  the  format 
of  filter  output  averaged  over  each  segment  for  eight 
segments  and  16  filters 

• SPPLTI  - Prints  time-frequency  spectrograms  of  data  from 
a speech  data  file  on  the  line  printer 


2.3.5  File  Edit i ng  Routines 


Three  routines,  FILAS,  DATED2 , and  FILED  were  used  for  editing 
of  the  speech  data  files.  Each  of  these  routines  leaves  the 
original  file  intact  and  writes  the  edited  file  as  a new  file. 
FILAS  permits  assembly  of  a new  file  from  one  or  more  old  files. 
Words  can  be  selected  by  word  and  sample  number  for  inclusion  in 
the  new  file. 


FILED  provides  for  modification  of  the  header  data  in  words  in  a 
file  or  deletion  of  words  altogether.  The  principal  use  was  for 
relabeling  of  words  where  known  errors  had  been  made  in  the  orig- 
inal data  transfer  process. 
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The  routine  DATED2  provides  for  editing  of  the  spectral  data  by 
elimination  of  frames.  One  command  causes  an  intensity  modulated 
spectral  representation  of  the  word  to  be  displayed  on  the  CRT. 
The  word  can  be  truncated  by  chopping  off  frames  from  the  begin- 
ning or  the  end  or  the  word  can  be  eliminated  altogether.  The 
routine  was  initially  developed  for  manual  elimination  of  breath- 
ing noise  from  word  samples.  It  later  proved  valuable  for  iso- 
lating samples  of  noise  or  specific  phonemes  for  further  analy- 
sis . 

2.4  EXPERIMENTS  AND  RESULTS 

This  section  will  describe  the  experiments  performed  on  the  data 
base  and  the  results  obtained.  The  objectives  of  the  experiments 
were  to  determine  what,  if  any,  differences  there  were  in  the 
speech  patterns  taken  at  different  g-force  levels  and  to  improve 
the  overall  recognition  result.  The  experimental  work  was  done 
on  the  DEC-10  computer  using  the  speech  data  files  generated  as 
described  in  Section  2.2. 

2.4.1  Initial  Experiments 

The  very  first  experiments  were  run  on-line  as  the  data  were  be- 
ing collected.  The  VDETS  system,  installed  as  described  in  Sec- 
tion 2.1,  provided  recognition  results  on  the  CRT  display  as  the 
words  were  being  spoken  in  the  centrifuge  gondola.  Prior  to  the 
start  of  the  data  gathering  effort  the  system  was  checked  out 
with  a speaker  seated  in  the  centrifuge  gondola  and  using  the 
microphone  normally  supplied  with  the  system.  With  the  ten-word 
vocabulary  the  recognition  rate  of  VDETS  should  have  been  near 
perfect,  i.e.  99%  or  better.  While  no  results  were  formally  re- 
corded, it  was  apparent  that  the  machine  was  operating  normally. 
The  speaker  could  go  through  the  word  list  at  least  ten  times 
without  error. 
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When  the  input  to  the  machine  was  switched  over  to  the  face  mask 
microphone  that  was  to  be  used  in  the  data  gathering , the  opera- 
tion of  the  system  was  clearly  degraded  although  again,  no  for- 
mal results  were  recorded.  As  the  data  gathering  proceeded  we 
attempted  to  mark  score  sheets  for  the  first  subject.  There  were 
five  errors  in  five  passes  for  the  first  lg  run  for  a recognition 
rate  of  90%.  This  run  was  made  immediately  after  the  training 
process  so  that  there  was  no  g-force  stress  at  all.  During  the 
3g  run  for  subject  1,  there  were  35  errors  in  92  words  for  a 
recognition  rate  of  62%.  With  the  errors  coming  this  frequently, 
it  was  difficult  to  keep  up  with  the  score  sheet  and  for  most  of 
the  remaining  subjects  no  scoring  was  recorded.  Subject  7,  how- 
ever, one  of  the  subjects  using  the  experimental  mask  as  describ- 
ed in  Section  2.1,  did  somewhat  better.  For  this  subject  the 
rate  was  98%  for  the  first  lg  run  of  50  words  and  88%  for  the  3g 
run  of  78  words. 

When  the  tapes  were  received  at  the  SEI  facility  and  run  through 
the  VDETS  development  support  system,  similarly  poor  results  were 
obtained.  It  was  obvious  that  the  breathing  noise  was  contribut- 
ing to  the  problem.  The  face  mask  used  fits  tightly  around  the 
nose  and  mouth  so  that  practically  all  of  the  subject's  air  in- 
take must  pass  through  a supply  tube  approximately  two  feet  in 
length  with  a diameter  of  about  one  inch.  Even  with  the  subject 
at  rest  there  is  a perceptible  noise  associated  with  a breath  in- 
take. The  situation  was  worse  during  the  g-force  runs  where  the 
subject  was  somewhat  winded  as  a result  of  his  efforts  to  "get 
on  top."  The  first  step  to  a solution  of  the  breathing  noise 
problem  was  to  print  out  spectrograms  of  some  of  the  data. 

Figure  6 shows  one  of  these  with  breathing  noise  clearly  apparent. 
The  next  step  was  to  manually  edit  out  segments  of  breathing 
noise  through  the  use  of  DATED2  as  described  in  Section  2.3.5. 

In  all  cases  the  edited  files  provided  better  recognition  rates 
than  their  unedited  sources.  The  next  step  was  to  automate  the 
editing  process. 
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Figure  6 . Spectrogram 


To  do  this,  samples  of  pure  breathing  noise  were  selected  from 
several  speech  data  files  using  DATED2 . Similarly,  samples  of 
unvoiced  fricatives  from  the  words  "six"  and  "seven"  were  select- 
ed. Averaged  spectra  for  these  samples  were  plotted  and  compar- 
ed. Figure  7 shows  such  a plot.  It  was  observed  that  both  types 
of  sound  had  energy  predominately  in  the  higher  frequency  filters. 
For  the  breathing  noise,  however,  the  energy  peaked  up  in  filters 
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Figure  7.  Averaged  Spectra,  Subject  2,  lg 


12  or  13  while  with  the  fricative  sounds  energy  peaked  up  in  fil- 
ter 15.  The  following  algorithm  was  devised  as  a test  for  each 
frame  of  data:  eliminate  the  frame  if 
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where  E ^ is  the  energy  in  the  j filter.  This  algorithm  was 
embodied  in  preprocessing  routine  BNE1 . When  BNEl  was  applied 
to  the  same  files  that  manual  editing  had  been  applied  to,  the 


results  of  recognition  experiments  were  within  1%  of  those  ob- 


tained with  the  manual  editing.  The  threshold  TH  was  set  to  10. 
The  same  algorithm  was  incorporated  in  the  VDETS  DSS  and  the 
results  were  favorable  for  most  subjects  to  the  extent  that  seg- 
ments of  breathing  noise  that  triggered  the  word  boundary  light 
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and  were  accepted  in  the  machine  as  words  without  the  breathing 
noise  eliminator  were  ignored  by  the  machine  with  the  breathing 
noise  eliminator  installed.  On  the  other  hand,  in  tests  where 
breathing  noise  was  not  present,  e.g.  live  input  with  the  normal- 
ly supplied  microphone,  the  breathing  noise  elimination  algorithm 
definitely  impaired  recognition  performance. 

Over  the  entire  series  of  tests  with  the  centrifuge  data  there 
were  eight  trials  where  all  other  processing  parameters  except 
the  breathing  noise  eliminator  were  held  constant.  In  these 
eight  trials  the  recognition  rate  with  the  breathing  noise  elim- 
inator was  better  in  six  cases,  poorer  in  two.  The  average  rates 
were  83.5%  with  the  breathing  noise  eliminator  and  80.3%  without. 


2.4.2  Alternate  Word  Boundary  Test 


In  the  normal  VDETS  algorithm  each  new  frame  of  data  is  tested  in 
the  following  way.  The  energy  change  between  the  new  frame  and 
the  previous  one,  as  defined  by 


16 

T | E . - E . _ I 

l=1  1 i,n  i ,n~l 

where  E^  n is  the  energy  in  the  ith  filter  for  the  nth  time  frame, 
is  computed.  If  this  change  is  greater  than  a threshold,  the  new 
sample  is  retrained;  otherwise  it  is  dropped.  All  of  the  data  in 
the  DEC-10  speech  data  files  were  originally  processed  with  this 
algorithm.  The  new  word  boundary  tested  each  sample  by  computing 
the  following  energy  sum 
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and  rejecting  the  sample  if  this  sum  were  below  some  threshold. 
This  algorithm  was  embodied  in  the  preprocessing  routine  BWBT. 
Over  six  trials  where  BWBT  was  applied  or  not  applied  and  all 
other  parameters  kept  the  same,  rates  were  better  in  five  cases 


29 


with  BWBT  and  poorer  in  only  one.  The  average  recognition  rates 
were  80.7%  with  and  76.4%  without. 

2.4.3  Modification  of  Segmentation  Routine 

In  the  then  current  version  of  the  VDETS  algorithm,  segmentation 
was  accomplished  by  dividing  the  number  of  frames  in  the  word  by 
the  number  of  segments,  thus  assigning  equal  numbers  of  frames 
to  each  segment.  An  alternate  segmentation  mode  based  on  energy 
change  as  described  in  Appendix  a was  made  available.  Actually 
this  segmentation  algorithm  was  one  that  had  originally  been 
used  in  SEI  voice  recognition  equipment  but  was  abandoned  in 
favor  of  the  somewhat  simpler  divide-by-N  process.  In  over  14 
trials  in  which  only  the  segmentation  mode  was  varied,  the  energy 
change  segmentation  was  better  in  nine  cases,  poorer  in  four 
cases,  and  equal  in  one  case.  The  average  recognition  rates 
over  these  trials  were  86.9  for  the  equal  energy  change  segmen- 
tation and  85.9  for  the  divide-by-N  mode. 

2.4.4  Processing  Mode 

As  described  in  Appendix  A and  Section  2.3  two  modes  of  process- 
ing the  raw  data  were  available  - the  compress-before-coding, 
process  mode  1 and  the  compress-after-coding,  process  mode  2. 

The  data  results  showed  these  modes  approximately  equal.  In  48 
trials  where  only  the  process  mode  varied,  mode  1 won  24  trials, 
mode  2 won  17  and  there  were  7 ties.  The  average  rate  was  78.2 
for  mode  1 and  80.8  for  mode  2. 

2.4.5  Modification  of  Coding 

Considerable  attention  was  paid  to  the  algorithm  by  which  the 
spectral  data  were  reduced  to  a coded  form.  The  coding  algo- 
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rithms  tested  are  described  in  Appendix  A.  Coding  modes  1 
through  4 were  initially  provided  in  the  RESLT3  package. 

Mode  4 was  the  mode  employed  in  the  then  current  VDETS  process. 

The  code  is  derived  by  comparing  filter  elements  in  a chain 
fashion,  i.e.  filter  1 versus  filter  2,  filter  2 versus  filter  3, 
etc.  Code  mode  3 was  a modification  of  this  but  in  mode  3 the 
comparisons  are  made  between  independent  pairs  of  spectral  data. 
Filter  1 is  compared  with  filter  2,  filter  3 with  filter  4,  etc., 
to  produce  8 elements  of  the  code.  The  filters  are  combined  in 
pairs  and  the  sum  of  the  outputs  of  filters  1 and  2 is  compared 
with  the  sum  of  the  outputs  of  3 and  4,  etc.  to  produce  four  more 
elements.  The  process  is  continued  with  filters  summed  together 
in  groups  of  four  and  then  eight  to  produce  15  elements  in  all. 

The  number  of  filter  elements  in  the  chain  coding  process  of 
mode  4 is  the  same  as  in  the  independent  coding  process  of  mode 
3.  Note,  however,  that  if  the  output  of  filter  2,  for  example, 
is  greater  than  that  of  filter  1,  then  it  is  more  likely  that  the 
output  of  filter  2 will  be  greater  than  that  of  filter  3 than  it 
would  have  been  if  the  output  of  2 had  been  less  than  that  of  1. 
Hence  the  elements  of  the  code  in  the  chain  process  are  not  sta- 
tistically independent.  The  independent  coding  process  therefore 
retains  more  information  about  the  spectrum  from  which  it  was 
derived  than  the  chain  coding  process  and  hence  should  be  more 
effective  in  the  speech  recognition  process. 

In  both  modes  3 and  4 the  code  elements  are  binary  and  the  coded 
representation  of  a spectral  frame  or  segment  requires  only  15 
bits.  The  training  process  produces  a set  of  15-bit  coded  seg- 
ments for  each  reference  pattern,  but  it  also  requires  an  equal 
number  of  bits  to  specify  the  masking  function  as  discussed  in 
Section  2.3.2.  Since  each  element  of  the  reference  pattern 
requires  2 bits  of  storage,  it  seemed  that  a 3-level  coding  scheme 
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might  be  more  effective  than  the  binary  one  while  requiring  no 
more  storage  for  the  reference  patterns. 

In  the  3-level  scheme,  spectral  samples  A and  B are  compared  to 
produce  a 3-level  code  element  as  shown  in  Figure  8.  if  the 
ratio  A/B  is  greater  than  T,  the  coded  output  is  10,  if  the  ratio 
is  less  than  1/T  the  output  is  01;  otherwise  the  output  is  00. 

The  threshold  can  have  any  value,  but  the  coding  process  is 
computationally  simple  if  T is  of  the  form  (2n  + l)/(2n  - 1). 

In  this  case  the  coding  requires  only  the  comparisons  A versus  B 
and  2n  |A  - B;  versus  2n  |a  + b[.  The  only  multiplication  re- 
quired is  an  n-bit  left  shift. 

Three  level  coding  modes  1 and  2 were  implemented.  Mode  1 employ- 
ed independent  comparisons  analogous  to  mode  3 while  mode  2 em- 
ployed chain  type  comparisons  as  in  mode  4.  In  both  cases  the 
threshold  was  9/7.  Subsequently,  mode  6 was  added,  being  identi- 
cal to  mode  1 except  that  the  threshold  is  17/15. 

2.4.6  Feature  Evaluation 

The  coding  processes  described  so  far  produce  patterns  of  120 
elements  grouped  into  eight  segments  of  15  elements  each.  The 
training  process  is  designed  to  eliminate  from  the  reference 
pattern  any  of  these  elements  that  are  inconsistent  over  the  train- 
ing set  and  hence  presumed  to  be  of  relatively  little  value  in 
the  recognition  process.  A study  was  undertaken  to  investigate 
further  the  relative  value  of  the  elements  in  the  code.  It  seem- 
ed likely  that  not  all  of  the  elements  were  of  equal  usefulness 
in  the  word  recognition  process  and  possible  that  the  useful  ones 
might  vary  with  the  g-level. 

To  carry  out  this  investigation  a routine,  CODEST,  was  developed. 

This  routine  compares  a set  of  test  utterances  with  a set  of 
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reference  patterns  and  determines  the  rates  of  occurrences  of 
hamming  distances  zero,  one  and  two  for  both  in-class  and  out-of- 
class comparisons.  The  data  are  averaged  over  all  segments  of 
the  word  patterns.  The  results  then  are  estimates  of  the  pro- 
babilities for  obtaining  hamming  distances  of  zero,  one  or  two 
for  each  of  the  15  elements  of  the  code  for  in-class  and  out-of- 
class  reference  patterns.  A good  feature  for  the  recognition 
process  would  be  one  with  a high  probability  of  zero  hamming 
distance  in-class  and  a low  probability  of  zero  hamming  distance 
out-of-class.  A measure  of  this  type  of  effectiveness  is  the 
Battacharyya  distance1  defined  in  the  general  case  by 

B(S1,S2)  = -In [P(x|S1)P(x|S2)]i  dx 

where  S-^  and  S2  are  classes  and  P(xjS^)  is  the  conditional  proba- 
bility density  of  obtaining  feature  x when  the  sample  belongs  to 
class  i.  For  the  speech  tape  data  the  Battacharyya  distance  takes 
the  form 


B = -In  l [P.  (i ) P ^ (i) ] * 
i=0  in  out 


where  P^  (i)  is  the  probability  of  obtaining  a hamming  distance 
of  i for  an  in-class  comparison  and  P t(i)  is  the  similar  pro- 
bability for  an  out-of-class  comparison. 


Measurements  were  made  for  several  speakers  from  the  centrifuge 
experiments  and  from  live  inputs.  The  study  revealed  that  some 
elements  were  much  better  than  others  and  that  some  were  almost 
worthless  (B  approaching  zero) . There  was  generally  good  agree- 
ment in  the  poorer  elements  over  all  speakers  and  at  all  g - 
levels.  A scoring  mode  (mode  3)  was  added  that  permitted  weight- 
ing the  code  elements  unequally.  Several  tests  were  made  with 
this  approach  and  with  various  waight  assignments.  This  approach 
to  taking  advantage  of  the  relative  efficacy  of  the  various 
elements  was  abandoned  in  favor  of  the  approach  described  below. 
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It  was  found  that  the  code  elements  that  result  from  the  compari- 
son of  filters  13  with  14  and  filter  15  with  16  had  generally 
small  Battacharyya  distances  while  the  code  elements  resulting 
from  the  comparison  of  filters  1+2  with  3+4  and  5+6  with  7+8  had 
relatively  large  Battacharyya  distances.  Accordingly  a code  mode 
9 was  added  to  the  repertoire  in  which  the  two  elements  with  low 
Battacharyya  distances  were  replaced  with  elements  obtained  by 
comparing  filters  2+3  with  4+5  and  3+4  with  5+6.  Also,  since  a 
sixteenth  code  element  is  available  free,  as  it  were,  since  the 
system  runs  on  a 16  bit  machine,  an  additional  element  obtained 
from  comparison  of  filter  4+5  with  6+7  was  added  in  the  sixteenth 
position.  Code  mode  9 is  otherwise  the  same  as  mode  1,  i.e.  it 
uses  3 level  comparison  with  a threshold  of  9/7.  Code  mode  10 
was  also  added,  being  the  same  as  mode  9 except  with  a 17/15 
threshold . 

2.4.7  Evaluation  of  Coding  Modes 

An  evaluation  of  the  relative  effectiveness  of  coding  modes  1 
through  4,  6,  9 and  10  was  made  in  the  following  manner.  From 
all  of  the  recognition  experiments  run,  sets  were  selected  where 
only  the  coding  mode  varied,  all  other  parameters  being  held  con- 
stant. The  selection  of  such  sets  was  comprehensive.  From  these 
sets,  subsets  were  selected  for  each  coding  mode. 

The  result  was  a series  of  trials  for  each  mode  with  that  mode 
pi+ted  against  one  of  the  other  modes  in  an  equal  contest,  i.e. 
one  where  all  other  parameters  were  the  same.  For  each  mode  a 
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which  is  three  level  chain  type,  coding  was  the  poorest  of  all. 
Modes  1 and  6 where  the  only  difference  is  the  comparison 
threshold,  9/7  versus  17/15,  had  approximately  equal  records,  but 
between  modes  9 and  10  where  again  the  only  difference  is  the 
comparison  threshold,  mode  10  was  clearly  superior.  Mode  10,  in 
fact,  has  the  best  performance  by  far. 

2.4.8  Other  Coding  Modes 

Several  other  coding  algorithms  were  tried  with  results  that 
seem  to  be  significant  to  the  general  speech  recognition  problem. 
All  of  the  coding  modes  described  so  far  produce  outputs  that 
are  sensitive  to  spectral  slopes,  through  comparison  of  the 
energy  in  adjacent  sections  of  the  speech  power  density  spectrum. 
Other  types  of  coding,  not  based  on  slopes,  were  tried.  In  these 
modes  the  code  elements  were  derived  by  comparing  the  filter  out- 
put in  each  spectral  frame  with  the  average  filter  output  over 
the  utterance.  Several  variations  were  tried.  With  the  process- 
ing mode  2,  i.e,  code-before-compress,  it  was  possible  to  cal- 
culate the  average  spectrum  over  the  utterance  and  then  to  go 
back  and  use  this  in  the  coding  process.  Each  element  of  the 
code,  then,  was  generated  by  comparing  a filter  output  for  the 
frame  with  the  average  output  of  that  filter  for  that  utterance. 
Two  level  and  three  level  codes  were  tested.  Where  processing 
mode  1 was  used  the  average  spectrum  was  normalized  as  was  the 
average  spectrum  for  each  segment  before  applying  the  coding 
process.  Both  peak  normalizing  and  average  normalizing  were 
tried . 

These  coding  techniques  are  sensitive  to  spectral  shape  but  are 
almost  completely  insensitive  to  spectral  slopes.  In  all  cases 
tested  the  recognition  performance  was  quite  poor  in  comparison 
with  the  slope  sensitive  coding  modes.  Differential  rates  of 
20  to  25%  were  common. 


2.4.9 


Inverse  Filtering 


Recognition  tests  with  the  available  repertoire  of  processing 
modes  revealed  a marked  degradation  in  the  capability  of  any  of 
the  algorithms  with  increasing  g-level.  For  example  Appendix  B 
shows  typical  recognition  rates  for  subject  2 at  lg,  3g  and  5g. 

An  investigation  was  made  to  determine  why  this  was  so,  and  if 
there  were  any  characteristic  changes  in  the  speech  pattern  with 
g-level . 

The  general  method  employed  in  the  investigation  was  the  compari- 
son of  spectral  plots  at  the  various  g-levels.  One  approach  was 
to  average  the  response  characteristic  over  one  repetition  of 
the  word  list.  Figures  9 through  14  illustrate  the  results  for 
several  subjects.  In  each  case  the  composite  spectra  at  lg,  3g 
and  5g  are  plotted.  For  comparison  Figure  15  shows  the  composite 
spectra  for  3 separated  repetitions  of  the  word  list  at  lg  for 
subject  2.  (The  spectra  were  peak  normalized  in  all  cases.) 

While  there  is  a marked  change  in  the  spectral  characteristics 
at  different  g levels  for  all  subjects,  there  is  no  apparent 
pattern  to  these  changes.  In  most  cases,  the  upper  peak  in  the 
spectrum  seemed  to  shift  down  by  at  least  one  filter  from  lg  to 
3g  and  5g.  For  subject  6,  however,  the  reverse  occurred  and  the 
upper  peak  shifted  upward  for  the  higher  g-levels.  For  subject 
5 the  peak  occurred  at  the  same  place  at  all  three  levels.  The 
relative  amplitude  of  the  peak  in  the  lower  end  of  the  spectrum 
varied  considerably  with  g-level  although  the  peak  itself  gener- 
ally remained  in  the  same  filter  for  a given  subject.  Again 
there  was  no  consistency  in  the  direction  of  the  changes. 

Another  approach  was  to  compare  the  compressed  word  pattern  for 
individual  utterances  at  different  g-levels.  Figures  16  through 
20  illustrate  the  results.  These  plots  show  the  spectral  re- 
sponse versus  filter  number  for  each  of  the  eight  segments.  The 
wor^s  are  processed  by  segmentation  mode  2 and  processing  mode  1. 
Figures  16  and  17  are  two  i-spetitions  of  the  word  "four"  for 
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Figure  10.  Averaged  Spectra;  Subject  2;  1»  3,  5g 
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Figure  11.  Averaged  Spectra;  Subject  3;  1,  3,  5g 


Figure  12.  Averaged  Spectra;  Subject  5;  1,  3,  5g 
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Figure  13. 


Figure  14. 


Averaged  Spectra;  Subject  6;  1,  3,  5g 


Averaged  Spectra;  Subject  8;  1/  3,  5g 
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Figure  15.  Averaged  Spectra;  Subject  2;  lg 
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Figure  17.  Spectral  Response  vs  Filter  No., 
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Figure  18.  Spectral  Response  vs  Filter  No.,  "Four,"  5g 
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subject  3 at  lg.  Figure  18  is  one  repetition  of  the  same  word 
at  5g.  Note  the  shift  and  broadening  of  the  higher  frequency 
peak  segments  6,  7 and  8 at  5g.  Figures  19  and  20  show  the 
word  "eight"  at  lg  and  5g  respectively.  Note  the  down  shift  of 
the  upper  peak  through  most  of  the  segments. 

Similar  curves  were  plotted  for  other  words  and  subjects  with 
similar  results.  The  patterns  were  more  consistent  at  lg  than 
at  the  high  g levels  and  at  the  high  g levels  there  were  changes 
in  the  patterns  marked  by  broadening  and  shifting  of  peaks. 

A tentative  explanation  for  the  variation  of  the  spectral  charac- 
teristics of  the  speech  samples  with  g-level  follows.  The 
mechanism  for  the  production  of  speech  sounds  can  be  represented 
as  shown  in  Figure  21.  A signal  source  supplies  either  pulsed 
or  noise  like  signals  to  a filter  structure  consisting  of  the 
oral  or  nasal  cavities.  In  the  frequency  domain  we  may  represent 
the  spectrum  of  the  source  as  S (co)  and  that  of,  the  vocal  cavities 
as  H (co ) . Both  S and  H are  varied  by  the  speaker  in  producing 
different  sounds.  Where  the  speech  is  to  be  processed 
electronically  and  specifically  in  the  word  recognition  device 
we  must  add  a third  characteristic  G (co ) in  the  chain,  where  G is 
the  spectral  response  of  the  microphone  and  the  amplifier  cir- 
cuits used.  The  spectral  response  measured  is  the  product  of 
the  three  functions,  S (w)  / H(co),  and  G(w). 

In  most  word  recognition  applications  S and  H vary  character- 
istically with  the  words  spoken  while  G remains  constant.  Any 
variation  of  G that  significantly  modifies  the  spectral  charac- 
teristic might  be  expected  to  have  an  adverse  effect  on  the  word 
recognition  process.  In  the  centrifuge  experiments  a mechanism 
exists  for  the  variation  of  G with  g-level.  All  of  the  data 
were  taken  with  the  subject  wearing  a face  mask  with  a built-in 
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Figure  19.  Spectral  Response  vs  Filter  No.,  "Eight,"  lg 
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microphone.  The  mask  fits  tightly  around  the  nose  and  mouth  and 
creates  a resonant  cavity  that  will  significantly  modify  the 
spectral  characteristics  of  the  speaker's  voice. 

Now  if  the  mask  remains  stationary,  then  the  function  G(m), 
although  significantly  different  from  that  which  would  have 
been  obtained  without  the  mask,  at  least  remains  fixed,  and 
hence  does  not  interfere  with  the  recognition  process.  There 
are  two  mechanisms  that  could  cause  a change  in  the  response 
function  introduced  by  the  mask,  however.  First,  if  the  mask 
is  removed  and  then  replaced  it  may  be  put  in  a somewhat  differ- 
ent position;  and  second,  the  effect  of  g-force  stress  can  cause 
the  mask  to  slip  or  can  otherwise  distort  its  shape.  Both  of 
these  mechanisms  were  present.  In  most  cases  the  subject 
put  the  mask  on  and  left  it  in  position  through  the  training 
passes  and  the  first  lg  run.  In  most  cases  after  the  g-force 
runs,  the  mask  had  either  slipped  or  had  become  uncomfortable 
and  the  subject  had  to  readjust  it.  In  many  cases  he  removed 
the  mask  and  later  replaced  it. 

If  we  knew  what  the  function  G(lo)  was  we  could  eliminate  the 
effects  of  its  changing  by  multiplying  the  composite  character- 
istic by  the  inverse  1/G(w).  We  do  not  know  and  have  no  way  of 
measuring  G and  its  changes.  We  can,  however,  average  the 
response  over  one  or  more  repetitions  of  the  word  list  to 
obtain  the  average  response 

W = S7w)  HU)  G"U) 

where  denotes  the  average.  If  the  average  is  made  over 

conditions,  i.e.  constant  g-level,  where  G(w)  might  be  expected 
to  remain  constant,  then  we  may  remove  the  bar  from  over  G(oo). 
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If  we  then  use  the  inverse  of  W to  filter  the  instantaneous 
spectral  response  we  have  the  modified  spectral  function  for  the 
speech  data 

S(oi)  H(io) 

S (w)  H (oo ) 

This  function  may  or  may  not  be  as  good  as  the  original  function 
for  recognition  purposes,  but  at  least  the  effects  of  variation 
of  the  reproduction  channel  have  been  removed. 

A preprocessing  routine  SPEQ  was  written  to  apply  the  inverse 
filtering  described  above  to  speech  data  files.  SPEQ  goes 
through  the  file  and  takes  the  first  occurrence  of  each  word  in 
the  vocabulary  and  computes  the  spectral  average  over  this  set 
of  utterances.  The  routine  then  multiplies  all  spectral  data 
in  the  entire  file  by  the  inverse  of  the  average  spectrum  thus 
computed . 

A variant  of  this  was  the  routine  SPEQA.  Here  the  average 
spectrum  is  computed  as  in  SPEQ  over  the  set  of  first  utter- 
ances of  all  words  in  the  vocabulary.  SPEQA  then  goes  back  and 
applies  the  inverse  filter  to  all  spectra  also,  as  in  SPEQ.  The 
average  spectrum  of  the  word  just  processed  is  the  ratio  of 
(1/8):  (7/8)  for  the  new  and  old  spectra  respectively. 

Results  for  the  inverse  filtering  process  were  significant.  In 
63  trials  where  recognition  performance  with  either  SPEQ  or 
SPEQA  was  compared  with  performance  without  either  of  these 
being  applied  but  with  all  other  parameters  identical,  recogni- 
tion was  better  with  inverse  filtering  in  45  cases,  poorer  in  11, 
and  equal  in  7.  The  mean  rates  were  75.6%  with  inverse  filter- 
ing and  70.3%  without.  In  30  trials,  at  higher  g-levels,  i.e. 
greater  than  1,  inverse  filtering  produced  better  recognition 
in  23  cases,  poorer  in  only  1,  with  6 tries.  The  mean  rates 
were  64.2%  with  filtering  and  57.6%  without,  in  cases  where  SPEQ 
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and  SPEQA  were  both  applied,  their  effectiveness  was  about  equal. 
These  data  are  presented  in  somewhat  different  form  in  the  bar 
chart  of  Figure  22. 

It  is  obvious  from  the  recognition  rates  quoted  above  that 
inverse  filtering  is  not  the  panacea  that  will  make  the  recog- 
nition performance  under  g-force  stress  with  a face  mask  equal 
to  that  obtainable  under  normal  conditions.  There  are  several 
possible  reasons  why  this  is  not  so.  First  the  spectral  function 
S ( ou ) H (w)/S  (to)  H (to ) simply  may  not  be  as  good  as  the  unmodified 
function  for  the  recognition  purposes.  Second,  the  assumption 
that  G(w)  remains  constant  over  a test  run  at  a particular 
g-level  may  not  be  valid  in  that  the  mask  might  slip  or  move 
suddenly  at  any  time  during  the  run.  Finally,  it  may  be  that 
modification  of  the  system  response  function  caused  by  changes 
in  the  shape  or  position  of  the  face  mask  is  not  the  whole  story. 
Certainly  in  the  process  of  "getting  on  top,"  the  subject  exerts 
some  effort  and  becomes  somewhat  winded.  In  almost  all  cases 
the  second  lg  run  showed  significantly  poorer  recognition  rates 
than  the  first  one.  This  run  was  usually  made  immediately  after 
the  5g  run  while  the  subject  was  resting.  He  was  usually  winded 
at  the  beginning  of  this  run,  in  addition  to  having  possibly  re- 
moved and  replaced  the  face  mask.  It  might  be  expected  under 
these  conditions  that  more  errors  might  be  made  during  the  first 
part  of  the  second  lg  run  and  during  the  later  parts  in  that  the 
subject  was  regaining  his  breath  during  the  whole  period.  No 
such  effect  was  observed. 

2.4.10  Multimode  Training 

The  original  training  procedure  is  described  in  Section  2.1. 
During  this  procedure  only  one  reference  pattern  is  maintained 
for  each  word  of  the  vocabulary.  For  each  new  sample  this 
reference  pattern  is  modified  only  by  modifying  the  masking 
function  to  reflect  any  new  nonconsistent  elements.  This  mode 


51 


52 


of  training  is  normally  satisfactory  for  word  recognition 
applications.  It  does  have  the  weakness,  however,  that  a bad 
sample  can  cause  the  masking  of  an  undue  number  of  elements  in 
the  reference  pattern  for  a single  word.  Even  without  bad  samples, 
however,  words  may  be  spoken  in  more  than  one  mode  and  all  modes 
get  lumped  together  in  a single  reference  pattern. 

A scheme  for  multimode  training,  i.e.  where  each  vocabulary  word 
may  have  more  than  one  reference  pattern,  was  implemented.  The 
process  operates  as  follows.  The  coded  versions  of  all  samples 
of  all  words  in  the  training  set  are  maintained  in  storage  until 
the  hamming  distance  matrix  is  computed.  This  matrix  contains 
the  distance  between  each  sample  and  every  other  sample.  A 
threshold  distance  is  set  and  the  training  samples  are  grouped 
together  such  that  the  hamming  distances  between  all  members  of 
the  set  are  less  than  or  equal  to  the  threshold  distance.  Enough 
sets  are  chosen  so  that  each  sample  is  represented  in  at  least 
one  set.  Some  samples  may  appear  in  more  than  one  set,  however. 
Once  the  numbers  of  the  training  set  have  been  placed  intc  sub- 
sets within  the  threshold  distance,  then  the  training  procedure 
for  each  subset  is  carried  out  as  before. 

A reference  pattern  is  generated  with  all  nonconsistent  elements 
masked  out.  Note  that  if  the  threshold  distance  is  made  larger 
than  the  hamming  distance  could  possibly  be,  then  all  of  the 
training  samples  will  be  lumped  together  in  the  original  single- 
mode process. 

Multimode  training  was  found  to  be  effective.  The  parameter  RPD 
(reference  pattern  distance)  is  the  threshold  value.  A value  of 
50  or  60  for  this  parameter  was  found  to  be  effective  for  multi- 
mode  training.  For  single  mode  training,  the  value  for  RPD  is 
noted  as  INF.  In  30  trials  where  RPD  = INF  was  compared  with 
RPD  = 50  or  60  and  all  other  parameters  remain  fixed,  the  multi- 
mode  training  provided  an  average  recognition  rate  of  85.5  versus 
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80.8  for  the  single  mode  training  cases.  Of  the  30  trials  the 
recognition  rate  was  better  in  18  with  multimode,  better  in  only 
3 cases  with  single  mode,  and  there  were  9 ties. 

2.4.11  Time  Warp  Comparison 

At  the  beginning  of  the  program  it  was  believed  that  considerable 
benefit  to  the  recognition  process  might  be  achieved  by  the  use 
of  a time  warp  comparison  of  the  unknown  speech  pattern  with  the 
reference  pattern.  Such  techniques  have  been  successfully  used 
by  others  (accordingly,  software  was  developed  to  implement  the 
time  warp  comparison  by  dynamic  programming  as  described  by 
Itakura^.)  Listings  of  the  routines  involved  are  contained 
with  the  software  package  under  the  title  "FPATH . " 

Although  FPATH  is  applicable  to  a range  of  M x n element  compari- 
sons , it  has  been  used  on  this  program  only  for  comparison  of  8 
segment  unknowns  with  8 segment  reference  patterns.  Time  warp 
comparison  is  implemented  in  COMP  mode  1 of  RESLT3.  When  this 
mode  is  selected,  the  time  warp  comparison  is  invoked  both  for 
training  and  recognition. 

In  the  training  mode  the  time-warp  algorithm  is  used  to  align 
the  reference  pattern  with  each  new  sample  going  into  the  train- 
ing base.  The  comparison  is  done  so  that  each  segment  of  the 
reference  pattern  is  used  while  some  segments  of  the  new  train- 
ing sample  may  be  used  more  than  once  and  others  not  used.  Only 
if  the  optimum  path  turns  out  to  be  straight  will  all  segments 
of  the  new  sample  be  used  once  and  only  once.  Once  the  order  of 
comparison  has  been  established,  then  the  elements  of  the  code 
are  compared  for  matching  segments  of  the  reference  and  the  new 
samples,  and  inconsistent  segments  of  the  reference  pattern  are 
masked  as  in  normal  training.  In  the  recognition  mode  the  time- 
warp  algorithm  is  used  to  find  the  minimum  distance  path  and  the 
corresponding  distance  is  used  in  computing  the  score. 
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Results  with  the  time-warp  algorithm  were  disappointing.  In  nine 
trials  where  it  was  compared  with  the  straight  comparison  mode, 
it  provided  a better  recognition  rate  in  only  one  case  and  there 
the  difference  was  less  than  1%.  In  other  trials  the  straight 
comparison  mode  produced  better  results  by  3%  to  5%.  Further 
tests  with  the  time-warp  mode  were  abandoned. 

2.4.12  Summary  of  Results 

A table  with  comprehensive  listing  of  the  results  of  all  the 
recognition  experiments  performed  on  the  DEC-10  system  is  given 
in  Appendix  3.  The  records  in  this  table  are  sorted  by  subject 
number,  g-force  level  for  the  recognition  file,  the  file  from 
which  training  data  were  taken,  the  number  of  samples  and  sequence 
numbers  of  the  training  data,  and  the  recognition  rate. 

Table  III  lists  the  best  recognition  scores  obtained  for  all  sub- 
jects by  g-level . Also  shown  are  the  results  oDtained  with  the 
original  algorithm,  i.e.  the  VDETS  algorithm  used  in  the  machine 
at  the  time  the  data  were  taken.  This  table  indicates  the  pro- 
gress made  in  improving  the  algorithm  for  recognition  under  g- 
force  stress.  The  bar  chart  of  Figure  23  summarizes  the  perfor- 
mance versus  g-level  over  all  subjects. 
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TABLE  III.  BEST  RATES  AND  ORIGINAL  RATES 
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7G 

ORIG  BEST 

■ 

83.7 

91.8 

- 

94.6 

47.0 

56.6 

71.6 

84.2 

40 

89 

100 

64.6 

93.7 

65.8 

88.6 

70.5 

91.0 

■ 

- 

92.1 

- 

80.8 

- 

41.8 

- 

69.7 

33 

■ 

78.1 

90.4 

38.0 

79.8 

25.0 

51.6 

- 

5 

- 

100 

96.4 

- 

75.9 

- 

88.2 

100 

6 

98 

100 

79.3 

38.9 

90.2 

7 

- 

95.6 

- 

98.7 

- 

- 

98.0 

8 

- 

85.5 

- 

81.7 

- 

31.8 

- 

76.3 

9 

72.6 

88.7 

53.7 

73.2 

12.7 

36.7 



38.6 

54.2 
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3.  CONCLUSIONS 

The  objectives  of  this  program  were  to  investigate  the  effects 
of  g-force  stress  on  the  subjects'  vocal  patterns  as  applicable 
to  the  word  recognition  process  in  isolated  word  recognition 
systems,  and  to  find  a means  for  making  such  a system  work  under 
>•  the  adverse  conditions  of  g-force  stress.  The  investigators 

were  not  certain  at  the  outset  that  g-force  stress  would  cause 
any  significant  difference  in  the  voice  pattern  or  in  the  ability 
of  a word  recognition  device  to  function  with  acceptable  accuracy. 
Any  doubts  on  this  score  were  soon  dispelled  after  the  initial 
results  from  data  taken  on  the  human  centrifuge  at  the  Brooks 
Air  Force  Base  were  observed.  Recognition  rates  as  shown  in 
Table  III  were  well  below  the  acceptable  level  of  98%  and  de- 
creased with  increasing  g-force  stress.  The  rates,  however,  were 
below  the  acceptable  level  even  for  the  nonstress,  i.e.  lg,  case. 
This  indicated  that  something  more  than  g-force  stress  was  at 
work . 

Through  a number  of  experiments  on  the  data,  testing  various 
modifications  of  the  recognition  process,  it  was  found  to  be 
possible  to  improve  the  recognition  performance,  in  many  cases 
markedly.  Modest  improvements  were  obtained  by  modifying  the 
word  segmentation  and  coding  process.  Significant  improvements 
were  obtained  by  the  use  of  a multi-mode  training  process,  in- 
verse filtering  and  breath  noise  elimination  as  shown  in  Table 
IV.  With  these  measures  it  was  possible  for  some  subjects  to 
bring  the  recognition  rates  at  lg  and  3g  up  to  an  acceptable 
level.  A general  solution,  however,  was  not  found. 

It  may  certainly  be  concluded  that  putting  on  a face  mask  and 
riding  in  a centrifuge  at  g-force  levels  of  3g  and  higher  causes 
modifications  in  human  voice  patterns.  These  modifications  had 
an  adverse  effect  on  the  capability  of  the  SEI  VDETS  word  recog- 
nition device  and  would  undoubtedly  have  similar  effects  on  other 
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word  recognition  machines.  There  is  strong  evidence  that  the 
face  mask  is  largely  responsible  for  the  voice  pattern  variations 
with  g-force  stress.  The  face  mask  interposes  a frequency  de- 
pendent characteristic  in  the  transmission  path.  This  character- 
istic varies  with  the  positioning  and  shape  of  the  mask,  both  of 
which  can  change  with  g-force  stress.  Inverse  filtering  is  par- 
tially successful  in  counteracting  the  effects  of  the  mask. 


TABLE  IV.  RECOGNITION  RATE  IMPROVEMENTS 


Process 

Breath  Noise  Eliminator 

Inverse  Filtering 

(1,  3 & 5 g-levels) 
(3  & 5 g-levels) 

Multimode  Training 


Average  % Improvement 

3.2 

5.3 

6.6 

4.7 


We  believe  that  the  key  to  successful  operation  of  voice  recogni- 
tion devices  in  the  cockpits  of  fighter  aircraft  is  the  elimina- 
tion of  the  effects  of  the  face  mask.  Accordingly,  for  continu- 
ation of  the  investigation,  we  recommend  that  this  be  given 
primary  attention.  It  would  be  possible  to  eliminate  the  face 
mask  entirely  by  means  of  some  device  such  as  a throat  microphone. 
Also  it  might  be  possible  to  redesign  the  mask  in  some  way  so 
that  it  has  less  effect  on  the  acoustic  transmission  path.  From 
a practical  standpoint,  neither  of  these  solutions  is  applicable; 
some  processing  technique  must  be  found  to  eliminate  the  effects 
of  the  mask  from  the  signal. 
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The  inverse  filtering  process  investigated  in  this  program  was  a 
crude  attempt  to  eliminate  face  mask  effects.  It  was  partially 
successful,  but  a more  sophisticated  approach  is  required.  Some 
form  of  inverse  filtering  is  a possibility.  Another  possibility 
is  to  determine  those  portions  of  the  spectrum  most  affected  by 
the  face  mask  and  eliminate  them  from  the  recognition  process. 

Finally  it  has  been  demonstrated  that  the  reflection  coefficients 
derived  from  LPC  analysis  can  be  applied  directly  to  model  the 
vocal  tract.  ' The  face  mask  cavity  can  be  considered  as  an 
extension  of  the  vocal  tract,  and  the  speech  signal  can  be  repre- 
sented in  terms  of  reflection  coefficients.  It  is  possible, 
therefore,  that  such  a representation  can  be  used  for  recognition 
purposes  and  that  the  face  mask  effects  can  readily  be  eliminated 
by  separating  out  those  coefficients  associated  with  the  portion 
of  the  extended  vocal  tract  represented  by  the  mask. 
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APPENDIX  B 

RECOGNITION  RESULTS  SUMMARY 

The  following  table  presents  the  results  of  all  recognition 
experiments  conducted  on  the  program.  An  explanation  of  the 
columns  is  given  in  Table  B-l . 


TABLE  B-l . 

KEY  TO  RESULTS  DATA 

Col.  No. 

Heading 

Description 

1 

Record  identification  number 

2 

SUBJ 

Subject  number  (See  Table  II) 

3 

REC 

File  number  for  recognition  data 
(See  Table  B-2  for  key) 

4 

G 

G-force  level  of  file  in  Col.  3 

5 

TRAIN 

File  number  for  training  data 

6 

G 

G-force  level  of  file  in  Col.  5 

7 

PASS 

Number  of  training  passes 

8 

SEQ 

Sequence  numbers  of  training 
samples 

9 

SPRD 

Period  (in  milliseconds)  of  sam- 
pling clock 

10 

PROC 

Processing  mode  (See  Appendix  A) 

11 

CODE 

Code  mode  (See  Appendix  A) 

12 

COMP 

Comparison  mode  (See  Appendix  A) 

13 

SEGM 

Segmentation  mode  (See  Appendix  A) 

14 

NSEG 

Number  of  segments 

15 

FLTS 

Filters  used  in  energy  change  seg- 
mentation (See  Appendix  A) 

16 

SCR 

Scoring  mode  (See  Appendix  A) 

17-19 

PREPROC-1,-2,-3 

Preprocessing  applied  to  data  file 
(See  Sec.  2 .3.1 .2) 

20 

REF  PAT  DIS 

Threshold  for  separation  of  ref- 
erence patterns  in  multimode 
training  (See  Appendix  A) 

21 

MN  IN 

Mean  in-class  score 

22 

MAX  OUT 

Mean  of  the  largest  out-of-class 
score 

23 

PI 

Performance  index  (see  Sec.  2. 3.1.1) 

24 

RATE 

Recognition  rate 

B-l 


TABLE  B-2. 


KEY  FOR  FILE  NUMBERS 


TRF  - Original  files 

TRE  - Same  as  TRF  file  of  same  number  but  edited  to  correct  mis- 
labeled words 

TAF  - Several  TRF  files  of  the  same  g-force  level  combined 

COL  - Several  files  of  same  subject  but  different  g-levels  com- 
bined 

NRF  - Original  file  - 2nd  sampling.  NRF  and  TRF  files  with  the 
same  number  are  not  necessarily  for  the  same  original  data 

BED  - Manually  edited  for  breath  noise  elimination 
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METRIC  SYSTEM 


BASE  UNITS: 

Quantity 


Unit 


SI  Symbol 


Formula 


length 

metre 

m 

mass 

kilogram 

k-B 

time 

second 

s 

electric  current 

ampere 

A 

thermodynamic  temperature 

kelvin 

K 

amount  of  substance 

mole 

mol 

luminous  intensity 

candela 

cd 

SUPPLEMENTARY  UNITS. 

plane  angle 

radian 

rad 

solid  angle 

steradian 

sr 

DERIVED  UNITS: 

Acceleration 

metre  per  second  squared 

ms 

activity  (of  a radioactive  source) 

disin'egration  per  second 

(disintegration)^ 

angular  acceleration 

radian  per  second  squared 

rads 

angular  velocity 

radian  per  second 

rads 

area 

square  metre 

m 

density 

kilogram  per  cubic  metre 

kgm 

electric  capacitance 

farad 

F 

A-s/V 

electrical  conductance 

siemens 

S 

AV 

electric  field  strength 

volt  per  metre 

V'm 

electric  inductance 

henry 

II 

V-s/A 

electric  potential  difference 

volt 

V 

W A 

electric  resistance 

ohm 

V A 

electromotive  force 

volt 

V 

W A 

energy 

joule 

1 

N*m 

entropy 

joule  per  kelvin 

J/K 

force 

newton 

N 

kg-m/s 

frequency 

hertz 

Hz 

(cycle)/s 

illuminance 

lux 

lx 

Im/m 

luminance 

candela  per  square  metre 

cd/m 

luminous  flux 

lumen 

Im 

cd*sr 

magnetic  field  strength 

ampere  per  metre 

A/m 

magnetic  flux 

weber 

W'b 

Vs 

magnetic  flux  density 

tesla 

T 

Wb/m 

magnetomotive  force 

ampere 

A 

power 

watt 

W 

I'S 

pressure 

pascal 

Fa 

N/m 

quantity  of  electricity 

coulomb 

C 

A-s 

quantity  of  heat 

joule 

1 

N-m 

radiant  intensity 

watt  per  steradian 

W sr 

specific  heat 

joule  per  kilogram-kelvin 

)/kg-K 

stress 

pascal 

Fa 

N'm 

thermal  conductivity 

watt  per  metre-kelvin 

Wm-k 

velocity 

metre  per  second 

m/s 

viscosity,  dynamic 

pascal-second 

Fa-s 

viscosity,  kinematic 

square  metre  per  second 

m/s 

voltage 

volt 

V 

W'A 

volume 

cubic:  metre 

m 

wavenumber 

reciprocal  metre 

(wavo)/m 

work 

joule 

1 

N-m 

SI  PREFIXES: 


Multiplication  Fac 

tors 

I’rnfix 

SI  Symbol 

1 000  000  000  000  = 

11)'' 

(lira 

T 

1 000  000  000  = 

10" 

K'H® 

i; 

1 000  000  = 

10‘ 

mega 

M 

1 000  = 

10’ 

kilo 

k 

100  = 

HJ1 

herlo" 

h 

10  = 

10' 

(Inks' 

da 

0 1 = 

10-  ' 

dud* 

d 

0 01  * 

10  ' 

cnnti' 

t: 

0 001  - 

10- ’ 

mill! 

m 

0 000  001  = 

ur  * 

micro 

0 000  000  001 

10" 

nano 

n 

0 000  000  (too  001  - 

Kr" 

ptco 

P 

0 000  000  000  000  001 

10- •' 

fern  to 

f 

f)  1)00  000  000  000  000  001 

1!)-’* 

silo 

■ 

To  be  avoided  where  possible 


MISSION 
of 

Rome  Air  Development  Center 


RADC  plans  and  conducts  research,  exploratory  and  advanced 
development  programs  in  command,  control,  and  communications 
(C^ ) activities , and  in  the  C ^ areas  of  information  sciences 
and  intelligence . The  principal  technical  mission  areas 
are  communications,  electromagnetic  guidance  and  control, 
surveillance  of  ground  and  aerospace  objects,  intelligence 
data  collection  and  handling,  information  system  technology , 
ionospheric  propagation,  solid  state  sciences , microwave 
physics  and  electronic  reliability , maintainability  and 
compatibility . 


